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Abstract 

The convergence properties of the stationary Fokker-Planck algorithm for the esti- 
mation of the asymptotic density of stochastic search processes is studied. Theoret- 
ical and empirical arguments for the characterization of convergence of the estima- 
tion in the case of separable and nonseparable nonlinear optimization problems are 
given. Some implications of the convergence of stationary Fokker-Planck learning 
for the inference of parameters in artificial neural network models are outlined. 
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1 Introduction 



The optimization of a cost function which has a number of local minima is a 
relevant subject in all fields of science and engineering. In particular, most of 
machine learning problems are stated like oftenly complex, optimization tasks 
[lj. A common setup consist in the definition of appropriate families of models 
that should be selected from data. The selection step involves the optimization 
of a certain cost or likelihood function, which is usually defined on a high 
dimensional parameter space. In other approaches to learning, like Bayesian 
inference [TSUI], the entire landscape generated by the optimization problem 
associated with a set of models together with the data and the cost function 
is relevant. Other areas in which global optimization plays a prominent role 
include operations research [12], optimal design in engineenered systems [19] 
and many other important applications. 
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Stochastic strategies for optimization are essential to many of the heuristic 
techniques used to deal with complex, unstructured global optimization prob- 
lems. Methods like simulated annealing [T3"ll2T)fl?f2"o] and evolutionary popula- 
tion based algorithms [TUf7|22fllir2o] . have proven to be valuable tools, capa- 
ble of give good quality solutions at a relatively small computational effort. 
In population based optimization, search space is explored through the evolu- 
tion of finite populations of points. The population alternates periods of self - 
adaptation, in which particular regions of the search space are explored in an 
intensive manner, and periods of diversification in which solutions incorporate 
the gained information about the global landscape. There is a large amount of 
evidence that indicates that some exponents of population based algorithms 
are among the most efficient global optimization techniques in terms of com- 
putational cost and reliability. These methods, however are purely heuristic 
and convergence to global optima is not guaranteed. Simulated annealing on 
the other hand, is a method that statistically assures global optimality, but 
in a limit that is very difficult to acomplish in practice. In simulated anneal- 
ing a single particle explores the solution space through a diffusive process. 
In order to guarantee global optimality, the "temperature" that characterize 
the diffusion should be lowered according to a logarithmic schedule [8]. This 
condition imply very long computation times. 



In this contribution the convergence properties of an estimation procedure 
for the stationary density of a general class of stochastic search processes, re- 
cently introduced by the author [2|, is explored. By the estimation procedure, 
promising regions of the search space can be defined on a probabilistic ba- 
sis. This information can then be used in connection with a locally adaptive 
stochastic or deterministic algorithm. Preliminary applications of this density 
estimation method in the improvement of nonlinear optimization algorithms 
can be found in |23J . Theoretical aspects on the foundations of the method, its 
links to statistical mechanics and possible use of the density estimation pro- 
cedure as a general diversification mechanism are discussed in [3]. In the next 
section we give a brief account of the basic elements of our stationary density 
estimation algorithm. Thereafter, theoretical and empirical evidence on the 
convergence of the density estimation is given. Besides global optimization, 
the density estimation approach may provide a novel technique for maximum 
likelihood estimation and Bayesian inference. This possibility, in the context 
of artificial neural network training, is outlined in Section HI Final conclusions 
and remarks are presented in Section 
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2 Fokker— Planck learning of the stationary probability density of 
a stochastic search 



We now proceed with a brief account of the stationary density estimation 
procedure on which the present work is based. Consider the minimization of 
a cost function of the form V(xi, x 2 , x n , ...,xn) with a search space defined 
over L\ n < x n < L 2 n - A stochastic search process for this problem is modeled 
by 



OV 

x n = - o r-e(t), 

ox n 



where e(t) is an additive noise with zero mean. Equation ([T]), known as Langevin 
equation in the statistical physics literature [T7.26J, captures the essential 
properties of a general stochastic search. In particular, the gradient term gives 
a mechanism for local adaptation, while the noise term provides a basic diversi- 
fication strategy. Equation (pQ) can be interpreted as an overdamped nonlinear 
dynamical system composed by N interacting particles in the presence of ad- 
ditive white noise. The stationary density estimation is based on an analogy 
with this physical system, considering reflecting boundary conditions. It fol- 
lows that the stationary conditional density for particle n satisfy the linear 
partial differential equation, 



D *W(^)) +pW ^ = i . }) | = o (2) 



which is a one dimensional Fokker-Planck equation. An important conse- 
quence of Eq. (j2J) is that the marginal p(x n ) can be sampled by drawing points 
from the conditional p(x n \{xj^ n = x*}) via a Gibbs sampling [8j. Due to the 
linearity of the Fokker - Planck equation, a particular form of Gibbs sampling 
can be constructed, such that its not only possible to sample the marginal 
density, but to give an approximate analytical expression for it. From Eq. (|2"1) 
follows a linear second order differential equation for the cumulative distribu- 
tion y(x n \{x j7 L n = x*}) = X!^p(^nl{^¥n = Xj})dx' n , 

d 2 y + 1 dV dy _ Q 

dx n D dx n dx n 

y(L hn ) = 0, y(L 2 , n ) = 1. 
The boundary conditions 7/(L l n ) = 0, y{L 2>n ) = 1 came from the fact that 
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the densities are normalized over the search space. Random deviates can be 
drawn from the density p(x n \{xj^ n = x*}) by the inversion method [6], based 
on the fact that y is an uniformly distributed random variable in the interval 
y G [0, 1]. Viewed as a function of the random variable x n , y(x n \{xj^ n }) can 
be approximated through a linear combination of functions from a complete 
set that satisfy the boundary conditions in the interval of interest, 



L 

y{.x n \{xj-L n }) = J2 a m( x n)- (4) 

1=1 



Choosing for instance, a basis in which <pi(0) = 0, the L coefficients are 
uniquely defined by the evaluation of Eq. ([3]) in L — 1 interior points. In 
this way, the approximation of y is performed by solving a set of L linear alge- 
braic equations, involving L — 1 evaluations of the derivative of V. The basic 
sampling procedure, that we will call here Stationary Fokker-Planck (SFP) 
sampling, is based on the iteration of the following steps: 

1) Fix the variables Xj-L n = x* and approximate y{x n \{xj-^ n }) by the use of 
formulas ([3]) and (j4j). 

2) By the use of y(x n \{xj^ n }) construct a lookup table in order to generate a 
deviate x* n drawn from the stationary distribution p(x n \{xj^ n = x*}). 

3) Update * and repeat the procedure for a new variable Xj^ n . 

An algorithm for the automatic learning of the equilibrium distribution of the 
diffusive search process described by Eq. (JTJ can be based on the iteration of 
the three steps of the SFP sampling. A convergent representation for p(x n ) 
is obtained after taking the average of the coefficients a's in the expansion 
(jlj) over the iterations. In order to see this, consider the expressions for the 
marginal density and the conditional distribution, 



PM = / P{Xn\{Xj^n})p{{Xj^n})d{Xj^ n }, (5) 



y{Xn\{Xj?n}) = / P{x' n \{x^ n })dx n . (6) 



From the last two equations follow that the marginal y(x n ) is given by the 
expected value of the conditional y(x n \{xj^ n }) over the set {xj^ n }, 
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y(x n ) = E {x ^ n} [y(x n \{xj^ n })}. 



(7) 



All the information on the set {xj^ n } is stored in the coefficients of the ex- 
pansion Therefore 

L 

(V) = ^(ai)<Pl( x n) -> y(Xn), (8) 

1=1 

where the brackets represent the average over the iterations of the SFP sam- 
pling. 



3 Convergence of stationary Fokker— Planck learning 

The marginals p(x n ) = dy(x n ) / dx n give the probability that a diffusive particle 
be at any region x n dx n inside the search interval [L l n , L 2in ], under the action of 
the cost function. Convergence of the stationary density estimation procedure 
depends on: 

i) The existence of the stationary state. 

ii) Convergence of the SFP sampling. 

Conditions for the existence of the stationary state for general multi-dimensional 
Fokker-Planck equations can be found in [T7]. For our particular reflecting 
boundary case, in which the cost function and the diffusion coefficient do not 
depend on time, the basic requirement is the absence of singularities in the 
cost function. 

By the evaluation of Eq. (JSJ) at each iteration of a SFP sampling the station- 
ary density associated with the stochastic search can be estimated, and the 
accuracy of the estimate improves over time. We call this procedure a Sta- 
tionary Fokker-Planck Learning (SFPL) of a density. The convergence of the 
SFPL follows from the convergence of Gibbs sampling. It is known that under 
general conditions a Gibbs sampling displays geometric convergence (THE]- 
Fast convergence is an important feature for the practical value of SFPL like 
a diversification mechanism in optimization problems. The rigorous study of 
the links between the geometric convergence conditions (stated in [18] as con- 
ditions on the kernel in a Markov chain) with SFPL applied on several classes 
of optimization problems, should be a relevant research topic. At this point, 
some numerical experimentation on the convergence of SFPL is presented. 
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In what follows, the specific form of the expansion (J4]) 

^e^((2'-i) 5" £ ; ib I | o) 

1=1 \ ^[^2,n - -^l,n) / 

is used. 

The estimation algorithm converges in one iteration for separable problems. A 
separable function is given by a linear combination of terms, where each term 
involves a single variable. Separable problems generate an uncoupled dynamics 
of the stochastic search described by Eq. ([I]). This behavior is illustrated by the 
minimization of the Michalewicz's function, a common test function for global 
optimization algorithms [5]. The Michalewicz's function in a two dimensional 
search space is written as 

V(x u x 2 ) = - sinxi(sin(x 2 /7r)) 2m - sina; 2 (sin(2a; 2 /7r)) 2m (10) 

The search space is < x n < n. The Michalewicz's function is interesting as a 
test function because for large values of m the local behavior of the function 
gives little information on the location of the global minimum. For m = 10 
the global minimum of the two dimensional Michalewicz's function has been 
estimated has V ~ —1.89 and is roughly located around the point (2.2, 1.5), 
as can be seen by plotting the function. 

The partial derivatives of function ffTO]) with m = 10, have been evaluated 
for each variable at L — 1 equidistant points separated by intervals of size 
h = n/L. The resulting algebraic linear system has been solved by the LU 
decomposition algorithm [21]. In Fig. ([T]), Fig. ([2]) and Fig. ([3]) the functions 
y(xi) and y(x2) and their associated probability densities are shown. The 
densities have been estimated after a single iteration of SFPL. The densities 
p(xi) and p(x2) are straightforwardly calculated by taking the corresponding 
derivatives. In Fig. ([T]) with D = 1 and L = 5 is considered, while 

in Fig. (|2"1) D = 1 and L = 10. In Fig. a smaller randomness parameter 
is considered ( D = 0.4 ), using L = 20. Notice that even when D is high 
enough to allow an approximation of y with the use of very few evaluations of 
the derivatives, the resulting densities will give populations that represent the 
cost function landscape remarkably better than those that would be obtained 
by uniform deviates. 

The asymptotic convergence properties of the SFPL are now experimentally 
studied on the XOR optimization problem, 
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The XOR function is an archetypical example that displays many of the fea- 
tures encountered in the optimization tasks that arise in machine learning. 
This is a case with multiple local minima [2T] and strong nonlinear interac- 
tions between decision variables. In the experiment reported in Fig. H] and 
Fig. [5j two independent trajectories are followed over successive iterations. 
The parameters of the SFP sampler are D = 0.01 and L = 200. In Fig. |4]are 
reported the cost function values at the coordinates in which the marginals 
are maximum. For each trajectory, an initial point is uniformly drawn from 
the search space. As can be seen, both trajectories converge to a similarly 
small value of the objective function. The average cost function value, which 
is estimated by the evaluation of the cost function on 100 points uniformly 
distributed over the search space, is 2. After 280 iterations, the differences be- 
tween both trajectories are around 0.05% of the average cost function value. 
Moreover, the differences in objective value of the trajectories with respect to 
a putative global optimum 



f(x*) = 0.00026, (11) 
x* = (8.22885,-8.47952,-9.87758,9.10184, 
-4.55215, -5.05978, 9.98956, 9.96857, -4.91623), 



is < 0.117% of the average cost after the iteration 280. The putative global 
optimum in the search interval has been found by performing local search via 
steepest descent from a population of points draw from the estimated density. 

In order to check statistical convergence, the following measures are intro- 
duced, 



N 



av 



N 



J2 



n=l 
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Fig. 1. Evaluation of y and p by one iteration of the SFPL algorithm for the 
Michalewicz's function, using L = 5 and D = 1. Despite the very low number of 
gradient evaluations used, the algorithm is capable to find a probability structure 
that is consistent with the global properties of the cost function. 

1 N I 2 

JV 71=1 



where the brackets in this case represent statistical moments of the estimated 
marginals. Under the expansion (Q, all the necessary integrals are easily per- 
formed analytically. 

In the first graph of Fig. [5l the evolution over iterations of the SFP sampler of 
s and av for two arbitrary and independent trajectories is plotted. A very fast 
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Fig. 2. The same case reported in Fig. (pQ), but using L = 10. 

convergence in the measure av is evident. The measure s is further studied in 
the second graph of Fig. [5l where the difference on that measure among the 
two trajectories is followed over iterations. The convergence is consistent with 
a geometric behavior over the first (~ 100) iterations and shows an asymptotic 
power law rate. 



4 Maximum Likelihood Estimation and Bayesian Inference 

Besides its applicability like a diversification strategy for local search algo- 
rithms, the fast convergence of the SFPL could be fruitful to give efficient 
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x 

Fig. 3. Evaluation of y and p by one iteration of the SFPL algorithm for the 
Michalewicz's function. In this case L = 20 and D = 0.4. With the increment 
in precision and the reduction of the randomness parameter, SFPL finds a proba- 
bility density that is sharply peaked around the global minimum. Notice that the 
computational effort is still small, involving only 19 evaluations of the gradient. 

aproaches to inference, for instance in the training of neural networks. From 
the point of view of statistical inference, the uncertainty about unknown pa- 
rameters of a learning machine is characterized by a posterior density for the 
parameters given the observed data [T6yT4"] . The prediction of new data is 
then performed either by the maximization of this posterior (maximum like- 
lihood estimation) or by an ensemble average over the posterior distribution 
(Bayesian inference). To be specific, suposse a system which generates an out- 
put Y given an input X, such that the data is described by a distribution with 
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Fig. 4. Objective value of the point in which the marginals of the estimated density 
are maximum for two independent trajectories. 

first moment = f(X,w) and diagonal covariance matrix a 2 I. The 

problem is to estimate / from a given set of observations S. The parameters 
could be, for instance, different neural network weights and architectures. The 
observed data defines an evidence for the different ensemble members, given 
by the posterior p(w\S). In maximum likelihood estimation, training consists 
on finding a single set of optimal parameters that maximize p(w\S). Bayesian 
inference, on the other hand, is based on the fact that the estimator of / that 
minimizes the expected squared error under the posterior is given by [16] 



so training is done by estimating this ensemble average. 

In the SFPL framework proposed here, priors are always given by uniform 
densities. This choice involves very few prior assumptions, regarding the as- 
signment of reasonable intervals on which the components of w lie. Under an 
uniform prior, and if the data present in the sample has been independently 
drawn, it turns out that the posterior is given by 



where D = 2a 2 and V is the given loss function. The SFPL algorithm can 
be therefore directly applied in order to learn the marginals p(w n \S) of the 
posterior (fl~4"]) . By construction, these marginals will be properly normalized. 

It is now argued that SFPL can be used to efficiently perform maximum likeli- 
hood and Bayesian training. Consider again the XOR example. The associated 




(13) 



p(w\S) oc exp(—V/D) 



(14) 




iterations 



Fig. 5. Statistical convergence of the stationary density estimation procedure on the 
XOR problem. The value of the average first moment and standard deviation of the 
estimated marginals from two independent trajectories is plotted in the left. The 
graph on the right shows the distance between both average standard deviations. 
This distance decay at a geometric rate over the first ~ 100 iterations. Asymptot- 
ically the distance behave like a power law characterized by |si — sg] ~ M -0 67 , 
where M is the number of iterations. 

density has been estimated assuming a prior density for each parameter w n 
over the interval [—10, 10]. The posterior density, on the other hand, is a con- 
sequence of the cost function given the set of training data. In Section [3] it has 
been shown that for nonseparable nonlinear cost functions like in the XOR 
case, SFPL converges to a correct estimation of the marginal densities p(w n ). 
Therefore, the maximization of the likelihood is reduced to N line maximiza- 
tions, where N is the number of weights to be estimated. The advantage of 
this procedure in comparision with the direct maximization of p(w\S) is evi- 
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dent. On the other hand, the SFP sampler itself is designed as a generator of 
deviates that are drawn from the stationary density. The average (fi"3l) can be 
approximated by 



Y = (f(X)) « £ f(X, «,<«>), (15) 



without the need of direct simulations of the stochastic search, which is nec- 
essary in most of other techniques [16j. 

In Fig. [6] is shown the behavior of p(w n ) for a particular weight as the sample 
size increases. The parameters L = 200 and D = 0.01 are fixed. The two dotted 
lines correspond to cases with sample sizes of one and two, with inputs (0, 0) 
and (0,0), (1,1) respectively. The resulting densities are almost flat in both 
situations. The dashed line corresponds to a sample size of three. The sample 
points are (0,0), (1, 1), (0, 1). In this case the sample is large enough to give 
a sufficient evidence to favor a particular region of the parameter domain. 
The solid line corresponds to the situation in which all the four points of 
the data set are used for training. The resulting density is the sharpest. The 
parameter D is proportional to the noise strength in the stochastic search. It 
can be selected on the basis of a desired computational effort, as discussed in 
[3]. Figure [6] indicates that at a fixed noise level D, an increase of evidence 
imply a decrease on the uncertainty of the weights. This finding agrees with 
what is expected from the known theory of the statistical mechanics of neural 
networks [15], according to which the weight fluctuations decay as the data 
sample grows. 



The performance of maximum likelihood and Bayesian training is reported in 
Fig. [71 using the complete sample for the inference of the weights. The standard 
deviation of the error of the networks instantiated at the inferred weigths is 
reported at each iteration. The solid line corresponds to maximum likelihood 
training, which is essentially the same calculation already reported on Fig. 
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Fig. 6. The probability density of a particular weight (ws, a bias of one of the 
neurons in the hidden layer) of the ANN model for the XOR problem. The dotted 
lines correspond to cases with sample sizes of one and two. The dashed line is for 
the density that results from a sample of size three while the case for a sample size 
of four is given by the solid line. 
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Fig. 7. Performance of maximum likelihood (solid) and Bayesian (dashed) training 
for the XOR problem, using stationary Fokker-Planck learning to estimate weight 
distributions. 



IU The performance is very similar for Bayesian training, which corresponds 
to the dashed line. The estimation of the average (|T5l) as been performed by 
evaluating the neural network on a weight vector drawn by the SFP sampler 
at each iteration. In this way, the number of terms in the sum of Eq. (|T5|) is 
equal to the number of iterations of the SFP sampler. 
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Fig. 8. The difference in average cost function value for two independent trajectories 
in the robot arm problem. The ANN as been trained using 200 sample points. 

A larger example involving noisy data is now presented. Consider the "robot 
arm problem", a benchmark that has already been used in the context of 
Bayesian inference [16] . The data set is generated by the following dynamical 
model for a robot arm 



yi = 2.0cos(xi) + 1.3cos(xi + x?) + e± (16) 
?/2 = 2.0sin(xi) + 1.3sin(xi + X2) + e 2 

The inputs x\ and x 2 represent joint angles while the outputs y\ and y 2 give 
the resulting arm positions. Following the experimental setup proposed in |16j . 
the inputs are uniformly distributed in the intervals x\ G [—1.932, —0.453] U 
[0.453, 1.932], £2 £ [0.534,3.142]. The noise terms ei and e 2 are Gaussian and 
white with standard deviation of 0.1. A sample of 200 points is generated us- 
ing these prescriptions. A neural network with one hidden layer consisting on 
16 hyperbolic tangent activation functions is trained on the generated sam- 
ple using SFP learning, considering a squared error loss function. The same 
priors are assigned to all of the weights: uniform distributions in the interval 
[—1,1]. The average absolute difference in training errors for two independent 
trajectories is shown on Fig. [8] for a case in which L = 300, D = 0.00125 
and M = 300 iterations. Taking into account that the expected equilibrium 

D/2, it turns out that the differences between both 
trajectories are of the same order of magnitude as the expected equilibrium 
error in about 10 iterations. During the course of the total of 300 iterations 
of SFP of one of the trajectories, the network as been evaluated in the test 
input (—1.471,0.752). The resulting Bayesian prediction is shown on Fig. [9]in 
the form of a pair of histograms. The output that would be given by the exact 
model (jTBT) in the absence of noise is (1.177,-2.847). The Bayesian predic- 
tion given by SFPL has its mean at (1.24, —2.64) with a standard deviation 
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Fig. 9. Bayesian predictions for the robot arm problem in the test input 
(-1.471,0.752). The SFPL parameters are L = 300, D = 0.00125 and M = 300, 
using 200 sample points. 

of ~ 0.12 for each variable. Therefore the output given by the underlying 
model is contained in the 95% confidence interval around the Bayes expec- 
tation. Consistent predictions can also be obtained under less precision and 
data. In Fig. [TO] are shown the histograms obtained for with a sample 

size of 50 points, L = 200 and D = 0.01. Altough the Bayes prediction is more 
uncertain, it still is statistically consistent with the underlying process. 



In the robot arm example the number of gradient evaluations needed to ap- 
proximately converge to the equilibrium density in the L = 300 case was 
about 2(L — 1)M ~ 5980. This seems competitive with respect to previous ap- 
proaches, like the hybrid Monte Carlo strategy introduced by Neal. The reader 
is referred to Neal's book [16] in order to find a very detailed application of 
hybrid Monte Carlo to the robot arm problem. An additional advantage of 
the SFPL method lies on the fact that explicit expressions for the parameter's 
densities are obtained. 

Much more detailed experimentation is under current development. Additional 
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Fig. 10. Same case as in Fig. [9] but with SFPL parameters given by L 
D = 0.01, M = 300 and sample size of 50 points. 



200, 



studies regarding issues like generalization of more complex ANN models under 
a limited amount of data, in the spirit of the general framework for Bayesian 
learning [HUH], is currently a work in progress by the author. 



5 Conclusion 



Theoretical and empirical evidence for the characterization of the convergence 
of the density estimation of stochastic search processes by the method of sta- 
tionary Fokker-Planck learning as been presented. In the context of nonlinear 
optimization problems, the procedure turns out to converge in one iteration 
for separable problems and displays fast convergence for nonseparable cost 
functions. The possible applications of stationary Fokker-Planck learning in 
the development of efficient and reliable maximum likelihood and Bayesian 
ANN training techniques have been outlined. 
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